search

Search

This Notebook has been released under the Apache 2.0 open source license.
Did you find this Notebook useful?
Show your appreciation with an upvote
39
jxcross
Taikutsu
Wesley Chow
Sugandh
Mehdi G
Akash G
Desmond
DeveloperND
Krutarth Darji
Pratik Bhangale
Marília Prata
Gaurav Dudeja
Ramazan Ozer
Mohammed Roohul Ameen
Grant Chan
Input
93.67 KB
folder

Data Sources

      arrow_right

      Titanic: Machine Learning from Disaster

      Titanic: Machine Learning from Disaster
      arrow_right

      titanic leaked

      titanic leaked
Titanic: Machine Learning from Disaster
Start here! Predict survival on the Titanic and get familiar with ML basics

Last Updated: 8 years ago

About this Competition

Overview

The data has been split into two groups:

  • training set (train.csv)
  • test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary

VariableDefinitionKey
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

Output
5.54 KB
      calendar_view_week

      submission1.csv

      submission1.csv
      calendar_view_week

      submission2.csv

      submission2.csv
submission1.csv(2.77 KB)
get_app
Download
PassengerId
Survived
892
0
893
1
894
0
895
0
896
1
897
1
898
0
899
1
900
1
901
0
902
0
903
0
904
1
905
0
906
1
907
1
908
0
909
0
910
0
911
1
912
0
913
1
914
1
915
1
916
1
917
0
918
1
919
0
920
0
921
0
922
0
923
0
924
1
925
0
926
1
927
0
928
1
929
0
930
1
931
1
932
1
933
0
934
0
935
0
936
1
937
0
938
1
939
0
940
1
Execution Info
Succeeded
True
Exit Code
0
Used All Space
False
Run Time
29.5 seconds
Timeout Exceeded
False
Output Size
5.54 KB
Accelerator
None
TimeLine #Log Message
6.0s1[NbConvertApp] Converting notebook __notebook__.ipynb to notebook
6.2s2[NbConvertApp] Executing notebook with kernel: python3
25.1s3[NbConvertApp] Writing 259608 bytes to __notebook__.ipynb
28.0s4[NbConvertApp] Converting notebook __notebook__.ipynb to html
29.4s5[NbConvertApp] Support files will be in __results___files/
29.4s6[NbConvertApp] Making directory __results___files
29.4s7[NbConvertApp] Making directory __results___files
29.4s8[NbConvertApp] Making directory __results___files
29.4s9[NbConvertApp] Making directory __results___files
29.4s10[NbConvertApp] Making directory __results___files
29.4s11[NbConvertApp] Making directory __results___files
29.4s12[NbConvertApp] Writing 436574 bytes to __results__.html
29.4s13
29.4s15Complete. Exited with code 0.
Comments (25)
Sort by
Select...
zonic
tomzomaccontributor tier
tomzomacPosted on Latest Versiona month agoOptionsReply
1

Thank you for this very informative and humorous kernel. Upvoted 👍

soham mukherjeeexpert tier
soham mukherjeeKernel AuthorPosted on Latest Versiona month agoOptionsReply
1

thanks a lot @tomzomac

Nirajexpert tier
NirajPosted on Latest Versiona month agoOptionsReply
2

Upvoted!!

soham mukherjeeexpert tier
soham mukherjeeKernel AuthorPosted on Latest Versiona month agoOptionsReply
0

thanks a lot @nirajpoudel

Krutarth Darjimaster tier
Krutarth DarjiPosted on Latest Versiona month agoOptionsReply
1

Great kernel :)

soham mukherjeeexpert tier
soham mukherjeeKernel AuthorPosted on Latest Versiona month agoOptionsReply
0

thanks @krutarthhd

Maxence Lemerciercontributor tier
Maxence LemercierPosted on Latest Versiona month agoOptionsReply
1

Thank you very much for this! Lots of great insights in terms of Exploratory Data Analysis. If I may (although you probably already know this one since you mention the tendency of RandomForest not to overfit from training set): you use the entire training set to train your models (which is fine), but you measure the model's "score" based on data used in training to compare models. Wouldn't it be preferable to use unseen data (or, at least, cross-validation) in order to know how the models will behave on the test set?

soham mukherjeeexpert tier
soham mukherjeeKernel AuthorPosted on Latest Versiona month agoOptionsReply
1

True. But we have to select only one model. Right? So using the scoring on trained dataset I got to know which model is working on this dataset. In other cases while using different models It's better to ensembling.. But also it is not the proper learning. Just the way to get higher in leaderboard.. Also in this task you can use all these models and execute all the results. After that for each ID you can use these as voting.
Like in a particular id if number of 1 is more than number of 0, you can select 1 as your final answer.

Maxence Lemerciercontributor tier
Maxence LemercierPosted on Latest Versiona month agoOptionsReply
1

Thank you, that's good insights on how to easily use an ensemble methods (I hadn't thought of the voting system).
Regarding the question from earlier, would it be preferable to split the original training set into a train/validation set and then compare how each model performs on the validation set (and even use the validation set as early stopping in an ANN), OR use the entire training set to benefit from the entire data (because it is quite a small data set to begin with)?

soham mukherjeeexpert tier
soham mukherjeeKernel AuthorPosted on Latest Versiona month agoOptionsReply
0

@maxencelemercier sorry for this late reply. Actually the score is calculated by that process. As you can see all models are trained on the whole dataset but the score is calculated by splitting the dataset in 80:20. If you want to change that then obviously you can use sklearn.modelselection.traintest_split. Also you can build the validation matrix to find the proper accuracy. In my next version of this notebook it would be introduced.

Taikutsucontributor tier
TaikutsuPosted on Version 5 of 6a month agoOptionsReply
1

Good kernel Natsu, knowlege and humor.

soham mukherjeeexpert tier
soham mukherjeeKernel AuthorPosted on Version 5 of 6a month agoOptionsReply
0

Thanks @ravels1991

Marília Prataexpert tier
Marília PrataPosted on Version 5 of 6a month agoOptionsReply
1

Very good Notebook, charts and Dataset. Great sense of humor.

soham mukherjeeexpert tier
soham mukherjeeKernel AuthorPosted on Version 5 of 6a month agoOptionsReply
0

Thanks man!

G Ranjith kumarexpert tier
G Ranjith kumarPosted on Version 5 of 6a month agoOptionsReply
1

Nice work:)

soham mukherjeeexpert tier
soham mukherjeeKernel AuthorPosted on Version 5 of 6a month agoOptionsReply
0
Vadim Sokolovcontributor tier
Vadim SokolovPosted on Latest Version4 days agoOptionsReply
0

Excellent work! Thank you for humorous meme's. It was very funny 😀

Saurabh Pant983novice tier
Saurabh Pant983Posted on Latest Version4 days agoOptionsReply
0

i am just started doing ml for 45 days, am i supposed to know all this coding. I can understand all this but when i compare my model with yours, yours is far more complex, you have imputed missing values with great care whereas i just have imputed mode also my model's univariate and bivariate analysis sucks.
you mind referring me some course i can join, i have already completed udemy's machine learning a-z course.
thanks

RUDRA PRATAP SIGNHcontributor tier
RUDRA PRATAP SIGNHPosted on Latest Version4 days agoOptionsReply
0

Great work

tejas shyamcontributor tier
tejas shyamPosted on Latest Version4 days agoOptionsReply
0

Amazing :)

Siva Chinchalapunovice tier
Siva ChinchalapuPosted on Latest Version9 days agoOptionsReply
0

Thanks for the share

Michael Morozcontributor tier
Michael MorozPosted on Latest Version10 days agoOptionsReply
0

@soham1024 hi!👋

Thank you for inspiring research and solution, it was pleasure to read and apply it in my work!👍
Have you considered such features as 'Lucky ticket' (some ticket numbers are more lucky than others) and 'Cabin location' in you following entries?

Thank you!

Pratik Bhangalenovice tier
Pratik BhangalePosted on Latest Version15 days agoOptionsReply
0

Hello, I draw the pointplot for the Embarked. You shown that in Embarked C, males survival chance is more i.e percentage. I observed the data and found out its not true. Please check the image. If I am wrong, correct me. Nice code

I got plot as shown below:

G_R_Sexpert tier
G_R_SPosted on Version 5 of 6a month agoOptionsReply
0

Hey Natsu! How's it going with Lucy?? Nice meme kernel, by the way!

soham mukherjeeexpert tier
soham mukherjeeKernel AuthorPosted on Version 5 of 6a month agoOptionsReply
0

Thanks @danoozy44